NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fast Interpretable Greedy-Tree Sums

https://doi.org/10.1073/pnas.2310151122

Tan, Yan Shuo; Singh, Chandan; Nasseri, Keyan; Agarwal, Abhineet; Duncan, James; Ronen, Omer; Epland, Matthew; Kornblith, Aaron; Yu, Bin (February 2025, Proceedings of the National Academy of Sciences)

Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the Classification and Regression Trees (CART) algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS adapts to additive structure while remaining highly interpretable. Experiments on real-world datasets show FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding decision-making. Specifically, we introduce a variant of FIGS known as Group Probability-Weighted Tree Sums (G-FIGS) that accounts for heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. Theoretically, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that tree-sum models leverage disentanglement to generalize more efficiently than single tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS performs competitively with random forests and XGBoost on real-world datasets.
more » « less
Full Text Available
Robust automated calcification meshing for personalized cardiovascular biomechanics

https://doi.org/10.1038/s41746-024-01202-9

Pak, Daniel H; Liu, Minliang; Kim, Theodore; Ozturk, Caglar; McKay, Raymond; Roche, Ellen T; Gleason, Rudolph; Duncan, James S (December 2024, npj Digital Medicine)

Full Text Available
Experimental study on spray in the atmospheric surface layer by raindrops impacting water surface

https://doi.org/10.1017/jfm.2024.419

Liu, Xinan; Zhang, Xiguang; Zheng, Quanan; Duncan, James H (June 2024, Journal of Fluid Mechanics)

Spray formed by a myriad of secondary droplets generated by the impact of raindrops on a deep-water pool is studied with a laboratory rain facility. Experiments are performed with two rain rates and raindrops fall on the water surface at a nearly constant velocity. The secondary droplets at various heights above the pool's water surface are recorded with a cinematic digital in-line holographic technique that consists of a high-speed camera, a pulsed Nd:YLF laser and associated optics. The experimental results show that in the heat-map scatter plots of radius versus velocity near the water surface of the pool, the droplets are distributed into three regions, corresponding to distinct physical mechanisms of droplet generation. It is found that the diameter distribution of the droplets in the rain field changes with height above the pool's water surfaces. Both numerical simulation and experimental data reveal that the liquid water content, due to the presence of secondary droplets, in the atmospheric surface layer decreases exponentially with increasing height.
more » « less
Full Text Available
Medical image registration via neural fields

https://doi.org/10.1016/j.media.2024.103249

Sun, Shanlin; Han, Kun; You, Chenyu; Tang, Hao; Kong, Deying; Naushad, Junayed; Yan, Xiangyi; Ma, Haoyu; Khosravi, Pooya; Duncan, James S; et al (October 2024, Medical Image Analysis)

Full Text Available
simChef: High-quality data science simulations inR

https://doi.org/10.21105/joss.06156

Duncan, James; Tang, Tiffany; Elliott, Corrine F; Boileau, Philippe; Yu, Bin (March 2024, Journal of Open Source Software)

Full Text Available
Comparison between shadow imaging and in-line holography for measuring droplet size distributions

https://doi.org/10.1007/s00348-023-03633-8

Erinin, Martin A; Néel, Baptiste; Mazzatenta, Megan T; Duncan, James H; Deike, Luc (May 2023, Experiments in Fluids)
Kähler, C; Longmire, E; Westerweel, J (Ed.)
Abstract A direct comparison of the droplet size and number measurements using in-line holography and shadow imaging is presented in three dynamically evolving laboratory scale experiments. The two experimental techniques and image processing algorithms used to measure droplet number and radii are described in detail. Droplet radii as low as$$r = 14$$ $r = 14$ µm are measured using in-line holography and$$r = 50$$ $r = 50$ µm using shadow imaging. The droplet radius measurement error is estimated using a calibration target (reticle) and it was found that the holographic technique is able to measure droplet radii more accurately than shadow imaging for droplets with$$r \le 625$$ $r \leq 625$ µm. Using the measurements of droplet number and size we quantitatively cross-validate and assess the accuracy of the two measurement techniques. The droplet size distributions,N(r), are measured in all three experiments and are found to agree well between the two measurement techniques. In one of the laboratory experiments, simultaneous measurements of droplets ($$r \ge 14$$ $r \geq 14$ µm, using holography) and dry aerosols ($$0.07 \lessapprox r \lessapprox 2$$ $0.07 ⪅ r ⪅ 2$ µm, using an scanning mobility particle sizer and$$0.15 \lessapprox r \lessapprox 5$$ $0.15 ⪅ r ⪅ 5$ µm using an optical particle sizer) are reported, one of the first such comparison to the best of our knowledge. The total number and volume of droplets is found to agree well between both techniques in the three experiments. We demonstrate that a relatively simple shadow imaging technique can be just as reliable when compared to a more sophisticated holographic measurement technique over their common droplet radius measurement range. The agreement in results is shown to be valid over a large range of droplet concentrations, which include experiments with relatively sparse droplet concentrations as low as 0.02 droplets per image. Advantages and disadvantages for the two techniques are discussed in the context of our results. The main advantages to in-line holography are the greater accuracy in droplet radius measurement, greater spatial resolution, larger depth of field, and the high repetition rate and short pulse duration of the laser light source. In comparison, the main advantages to shadow imaging are the simpler experimental setup, image processing algorithm, and fewer computer resources necessary for image processing. Droplet statistics like number and size are found to be very reliable between the two methods for large range of droplet densities,$${\mathcal {P}}_{r>50}$$ $P_{r > 50}$ , ranging from$$10^{-4} \le {\mathcal {P}}_{r>50} \le 10^{-1}$$ $10^{- 4} \leq P_{r > 50} \leq 10^{- 1}$ cm$$^{-3}$$ $^{- 3}$ , when the two techniques are implemented as shown in this paper.
more » « less
Full Text Available
Fast Interpretable Greedy-Tree Sums (FIGS)

Tan, Yan Shuo; Singh, Chandan; Nasseri, Keyan; Agarwal, Abhineet; Duncan, James; Ronen, Omer; Epland, Matthew; Kornblith, Aaron; Yu, Bin (July 2023, ArXivorg)

Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the CART algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS is able to adapt to additive structure while remaining highly interpretable. Extensive experiments on real-world datasets show that FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding clinical decision-making. Specifically, we introduce a variant of FIGS known as G-FIGS that accounts for the heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. To provide further insight into FIGS, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that unconstrained tree-sum models leverage disentanglement to generalize more efficiently than single decision tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS enjoys competitive performance with random forests and XGBoost on real-world datasets.
more » « less
Full Text Available
Group Probability-Weighted Tree Sums for Interpretable Modeling of Heterogeneous Data

Nasseri, Keyan; Singh, Chandan; Duncan, James; Kornblith, Aaron; Yu, Bin (May 2022, ArXivorg)

Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of instances in a dataset (e.g., medical patients grouped by age or treatment site), our method first estimates group membership probabilities for each instance. Then, it uses these estimates as instance weights in FIGS (Tan et al., 2022), to grow a set of decision trees whose values sum to the final prediction. We call this new method Group Probability-Weighted Tree Sums (G-FIGS). G-FIGS achieves state-of-theart prediction performance on important clinical datasets; e.g., holding the level of sensitivity fixed at 92%, G-FIGS increases specificity for identifying cervical spine injury (CSI) by up to 10% over CART and up to 3% over FIGS alone, with larger gains at higher sensitivity levels. By keeping the total number of rules below 16 in FIGS, the final models remain interpretable, and we find that their rules match medical domain expertise. All code, data, and models are released on Github.
more » « less
Full Text Available
VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS

https://doi.org/10.21105/joss.03895

Duncan, James; Kapoor, Rush; Agarwal, Abhineet; Singh, Chandan; Yu, Bin (January 2022, Journal of Open Source Software)

Full Text Available
A Mixing Time Lower Bound for a Simplified Version of BART

Ronen, Omer; Saarinen, Theo; Tan, Yan Shuo; Duncan, James; Yu, Bin (January 2022, arXivorg)

Full Text Available

« Prev Next »

Search for: All records